In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

np.random.seed(1789)

from IPython.core.display import HTML
def css_styling():
    styles = open("styles/custom.css", "r").read()
    return HTML(styles)
css_styling()


Out[1]:

Statistical Data Modeling

Pandas, NumPy and SciPy provide the core functionality for building statistical models of our data. We use models to:

  • Concisely describe the components of our data
  • Provide inference about underlying parameters that may have generated the data
  • Make predictions about unobserved data, or expected future observations.

This section of the tutorial illustrates how to use Python to build statistical models of low to moderate difficulty from scratch, and use them to extract estimates and associated measures of uncertainty.

Estimation

An recurring statistical problem is finding estimates of the relevant parameters that correspond to the distribution that best represents our data.

In parametric inference, we specify a priori a suitable distribution, then choose the parameters that best fit the data.

  • e.g. the mean $\mu$ and the variance $\sigma^2$ in the case of the normal distribution

In [2]:
x = np.array([ 1.00201077,  1.58251956,  0.94515919,  6.48778002,  1.47764604,
        5.18847071,  4.21988095,  2.85971522,  3.40044437,  3.74907745,
        1.18065796,  3.74748775,  3.27328568,  3.19374927,  8.0726155 ,
        0.90326139,  2.34460034,  2.14199217,  3.27446744,  3.58872357,
        1.20611533,  2.16594393,  5.56610242,  4.66479977,  2.3573932 ])
_ = plt.hist(x, bins=7)


Fitting data to probability distributions

We start with the problem of finding values for the parameters that provide the best fit between the model and the data, called point estimates. First, we need to define what we mean by ‘best fit’. There are two commonly used criteria:

  • Method of moments chooses the parameters so that the sample moments (typically the sample mean and variance) match the theoretical moments of our chosen distribution.
  • Maximum likelihood chooses the parameters to maximize the likelihood, which measures how likely it is to observe our given sample.

Discrete Random Variables

$$X = \{0,1\}$$$$Y = \{\ldots,-2,-1,0,1,2,\ldots\}$$

Probability Mass Function:

For discrete $X$,

$$Pr(X=x) = f(x|\theta)$$

e.g. Poisson distribution

The Poisson distribution models unbounded counts:

$$Pr(X=x)=\frac{e^{-\lambda}\lambda^x}{x!}$$
  • $X=\{0,1,2,\ldots\}$
  • $\lambda > 0$
$$E(X) = \text{Var}(X) = \lambda$$

Continuous Random Variables

$$X \in [0,1]$$$$Y \in (-\infty, \infty)$$

Probability Density Function:

For continuous $X$,

$$Pr(x \le X \le x + dx) = f(x|\theta)dx \, \text{ as } \, dx \rightarrow 0$$

e.g. normal distribution

$$f(x) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left[-\frac{(x-\mu)^2}{2\sigma^2}\right]$$
  • $X \in \mathbf{R}$
  • $\mu \in \mathbf{R}$
  • $\sigma>0$
$$\begin{align}E(X) &= \mu \cr \text{Var}(X) &= \sigma^2 \end{align}$$

Example: Nashville Precipitation

The dataset nashville_precip.txt contains NOAA precipitation data for Nashville measured since 1871.

The gamma distribution is often a good fit to aggregated rainfall data, and will be our candidate distribution in this case.


In [3]:
precip = pd.read_table("../data/nashville_precip.txt", index_col=0, na_values='NA', delim_whitespace=True)
precip.head()


Out[3]:
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Year
1871 2.76 4.58 5.01 4.13 3.30 2.98 1.58 2.36 0.95 1.31 2.13 1.65
1872 2.32 2.11 3.14 5.91 3.09 5.17 6.10 1.65 4.50 1.58 2.25 2.38
1873 2.96 7.14 4.11 3.59 6.31 4.20 4.63 2.36 1.81 4.28 4.36 5.94
1874 5.22 9.23 5.36 11.84 1.49 2.87 2.65 3.52 3.12 2.63 6.12 4.19
1875 6.15 3.06 8.14 4.22 1.73 5.63 8.12 1.60 3.79 1.25 5.46 4.30

In [4]:
_ = precip.hist(sharex=True, sharey=True, grid=False)
plt.tight_layout()


The first step is recognizing what sort of distribution to fit our data to. A couple of observations:

  1. The data are skewed, with a longer tail to the right than to the left
  2. The data are positive-valued, since they are measuring rainfall
  3. The data are continuous

There are a few possible choices, but one suitable alternative is the gamma distribution:

$$x \sim \text{Gamma}(\alpha, \beta) = \frac{\beta^{\alpha}x^{\alpha-1}e^{-\beta x}}{\Gamma(\alpha)}$$

The method of moments simply assigns the empirical mean and variance to their theoretical counterparts, so that we can solve for the parameters.

So, for the gamma distribution, the mean and variance are:

$$ \hat{\mu} = \bar{X} = \alpha \beta $$ $$ \hat{\sigma}^2 = S^2 = \alpha \beta^2 $$

So, if we solve for these parameters, we can use a gamma distribution to describe our data:

$$ \alpha = \frac{\bar{X}^2}{S^2}, \, \beta = \frac{S^2}{\bar{X}} $$

Let's deal with the missing value in the October data. Given what we are trying to do, it is most sensible to fill in the missing value with the average of the available values. We will learn more sophisticated methods for handling missing data later in the course.


In [5]:
precip.fillna(value={'Oct': precip.Oct.mean()}, inplace=True)


Out[5]:
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Year
1871 2.76 4.58 5.01 4.13 3.30 2.98 1.58 2.36 0.95 1.31 2.13 1.65
1872 2.32 2.11 3.14 5.91 3.09 5.17 6.10 1.65 4.50 1.58 2.25 2.38
1873 2.96 7.14 4.11 3.59 6.31 4.20 4.63 2.36 1.81 4.28 4.36 5.94
1874 5.22 9.23 5.36 11.84 1.49 2.87 2.65 3.52 3.12 2.63 6.12 4.19
1875 6.15 3.06 8.14 4.22 1.73 5.63 8.12 1.60 3.79 1.25 5.46 4.30
1876 6.41 2.22 5.28 3.62 3.40 5.65 7.15 5.77 2.52 2.68 1.26 0.95
1877 4.05 1.06 4.98 9.47 1.25 6.02 3.25 4.16 5.40 2.61 4.93 2.49
1878 3.34 2.10 3.48 6.88 2.33 3.28 9.43 5.02 1.28 2.17 3.20 6.04
1879 6.32 3.13 3.81 2.88 2.88 2.50 8.47 4.62 5.18 2.90 5.85 9.15
1880 3.74 12.37 8.16 5.26 4.13 3.97 5.69 2.22 5.39 7.24 5.77 3.32
1881 3.54 5.48 2.79 5.12 3.67 3.70 0.86 1.81 6.57 4.80 4.89 4.85
1882 14.51 8.61 9.38 3.59 7.38 2.54 4.06 5.54 1.61 1.11 3.60 1.52
1883 3.76 7.90 3.98 9.12 4.82 3.82 4.94 4.47 2.23 5.27 3.11 4.97
1884 7.20 8.18 8.89 3.51 3.58 6.53 3.18 2.81 2.36 2.43 1.57 3.78
1885 6.29 2.00 2.33 3.75 4.36 3.72 5.26 1.02 5.60 2.99 2.73 2.90
1886 5.18 3.82 4.76 2.36 2.10 7.69 1.90 5.50 3.68 0.51 5.76 1.48
1887 5.13 8.47 3.36 2.67 3.43 2.31 3.77 2.89 6.85 1.92 2.29 5.31
1888 6.29 3.78 6.46 4.18 2.97 4.68 2.36 7.03 3.82 2.82 4.33 1.77
1889 3.83 1.84 2.47 2.83 5.30 5.33 2.74 1.57 6.81 1.54 6.88 1.17
1890 8.10 10.95 8.64 3.84 4.16 2.23 0.46 6.59 5.86 3.01 2.01 4.12
1891 6.15 6.96 10.31 2.24 2.39 6.50 1.49 3.72 1.25 0.84 6.71 4.26
1892 2.81 2.73 4.10 7.45 4.03 5.01 5.13 3.39 4.78 0.25 3.91 6.43
1893 1.27 4.88 3.37 4.11 7.31 4.74 2.12 1.92 6.43 3.68 2.97 3.50
1894 4.28 8.65 2.69 4.05 2.53 3.55 5.45 2.43 3.07 0.53 1.92 2.81
1895 5.71 0.98 5.09 3.07 2.05 2.90 7.14 1.40 6.69 1.57 2.14 4.09
1896 1.37 3.65 6.45 2.92 4.05 1.82 7.33 1.40 2.74 0.98 5.71 1.79
1897 3.13 3.84 8.49 5.79 1.22 1.82 8.53 2.34 0.19 0.92 2.83 4.93
1898 9.46 0.63 5.36 3.16 1.80 4.97 4.50 6.56 4.87 3.21 3.09 2.41
1899 5.59 5.19 7.81 3.25 3.36 0.75 6.44 2.53 1.50 1.83 1.55 4.64
1900 2.61 3.80 2.20 4.04 1.86 10.35 2.87 1.24 4.55 3.93 8.87 2.22
... ... ... ... ... ... ... ... ... ... ... ... ...
1982 6.50 4.80 3.00 4.36 4.19 2.28 5.47 3.46 3.23 1.91 3.87 6.36
1983 2.56 2.93 3.44 6.80 11.04 3.93 1.71 1.36 0.45 2.77 6.98 7.75
1984 1.79 2.38 5.14 8.41 9.68 4.49 6.63 2.42 0.97 6.00 6.20 2.38
1985 3.02 3.30 2.70 2.91 2.65 1.53 2.00 3.91 2.52 1.59 3.81 0.98
1986 0.19 3.59 2.29 0.52 3.36 2.38 0.77 3.38 2.19 2.19 7.43 3.31
1987 1.61 4.87 1.18 1.03 4.41 2.82 2.56 0.73 1.95 0.21 3.40 5.46
1988 3.73 2.02 2.18 2.09 1.86 0.45 3.26 2.39 2.45 1.54 5.49 3.95
1989 4.52 9.36 5.31 2.68 4.61 7.87 3.18 3.67 6.30 3.62 3.94 1.97
1990 2.76 4.73 3.26 1.60 2.80 2.37 4.86 3.12 2.13 4.41 4.29 10.76
1991 2.92 5.44 4.25 3.35 5.63 1.25 2.82 1.79 5.47 3.88 2.87 7.27
1992 2.97 2.60 4.50 0.77 3.12 4.31 5.89 3.25 3.45 1.62 4.48 2.88
1993 2.76 3.33 5.50 3.33 4.50 5.31 3.64 1.76 2.90 2.20 2.53 6.62
1994 4.36 6.18 7.56 5.72 3.76 8.08 4.82 5.05 4.20 3.31 4.04 2.69
1995 5.61 1.81 3.87 3.95 7.66 3.69 1.95 3.40 5.00 5.60 3.98 2.32
1996 3.82 2.46 5.15 3.68 4.48 3.68 5.45 1.09 4.88 3.16 6.00 4.77
1997 4.19 3.10 9.64 2.42 4.92 6.66 3.26 3.52 5.75 2.71 6.59 2.19
1998 3.68 4.11 3.13 6.31 4.46 11.95 4.63 2.93 1.39 1.59 1.30 6.53
1999 9.28 2.33 4.27 2.29 4.35 3.56 3.19 3.05 1.97 2.04 2.99 2.50
2000 3.52 3.75 3.34 6.23 7.66 1.74 2.25 1.95 1.90 0.26 6.39 3.44
2001 3.21 8.54 2.73 2.42 5.54 4.47 2.77 4.07 1.79 4.61 5.09 3.32
2002 4.93 1.99 9.40 4.31 3.98 3.76 5.64 3.13 6.29 4.48 2.91 5.81
2003 1.59 8.47 2.30 4.69 10.73 7.08 2.87 3.88 8.70 1.80 4.17 3.19
2004 3.60 5.77 4.81 6.69 6.90 3.39 3.19 4.24 4.55 4.90 5.21 5.93
2005 4.42 3.84 3.90 6.93 1.03 2.70 2.39 6.89 1.44 0.02 3.29 2.46
2006 6.57 2.69 2.90 4.14 4.95 2.19 2.64 5.20 4.00 2.98 4.05 3.41
2007 3.32 1.84 2.26 2.75 3.30 2.37 1.47 1.38 1.99 4.95 6.20 3.83
2008 4.76 2.53 5.56 7.20 5.54 2.21 4.32 1.67 0.88 5.03 1.75 6.72
2009 4.59 2.85 2.92 4.13 8.45 4.53 6.03 2.14 11.08 6.49 0.67 3.99
2010 4.13 2.77 3.52 3.48 16.43 4.96 5.86 6.99 1.17 2.49 5.41 1.87
2011 2.31 5.54 4.59 7.51 4.38 5.04 3.46 1.78 6.20 0.93 6.15 4.25

141 rows × 12 columns

Now, let's calculate the sample moments of interest, the means and variances by month:


In [6]:
precip_mean = precip.mean()
precip_mean


Out[6]:
Jan    4.523688
Feb    4.097801
Mar    4.977589
Apr    4.204468
May    4.325674
Jun    3.873475
Jul    3.895461
Aug    3.367305
Sep    3.377660
Oct    2.610500
Nov    3.685887
Dec    4.176241
dtype: float64

In [7]:
precip_var = precip.var()
precip_var


Out[7]:
Jan    6.928862
Feb    5.516660
Mar    5.365444
Apr    4.117096
May    5.306409
Jun    5.033206
Jul    3.777012
Aug    3.779876
Sep    4.940099
Oct    2.741659
Nov    3.679274
Dec    5.418022
dtype: float64

We then use these moments to estimate $\alpha$ and $\beta$ for each month:


In [8]:
alpha_mom = precip_mean ** 2 / precip_var
beta_mom = precip_var / precip_mean

In [9]:
alpha_mom, beta_mom


Out[9]:
(Jan    2.953407
 Feb    3.043866
 Mar    4.617770
 Apr    4.293694
 May    3.526199
 Jun    2.980965
 Jul    4.017624
 Aug    2.999766
 Sep    2.309383
 Oct    2.485616
 Nov    3.692511
 Dec    3.219070
 dtype: float64, Jan    1.531684
 Feb    1.346249
 Mar    1.077920
 Apr    0.979219
 May    1.226724
 Jun    1.299403
 Jul    0.969593
 Aug    1.122522
 Sep    1.462581
 Oct    1.050243
 Nov    0.998206
 Dec    1.297344
 dtype: float64)

We can use the gamma.pdf function in scipy.stats.distributions to plot the ditribtuions implied by the calculated alphas and betas. For example, here is January:


In [10]:
from scipy.stats.distributions import gamma

precip.Jan.hist(normed=True, bins=20)
plt.plot(np.linspace(0, 10), gamma.pdf(np.linspace(0, 10), alpha_mom[0], beta_mom[0]))


Out[10]:
[<matplotlib.lines.Line2D at 0x10ad679b0>]

Looping over all months, we can create a grid of plots for the distribution of rainfall, using the gamma distribution:


In [11]:
axs = precip.hist(normed=True, figsize=(12, 8), sharex=True, sharey=True, bins=15, grid=False)

for ax in axs.ravel():
    
    # Get month
    m = ax.get_title()
    
    # Plot fitted distribution
    x = np.linspace(*ax.get_xlim())
    ax.plot(x, gamma.pdf(x, alpha_mom[m], beta_mom[m]))
    
    # Annotate with parameter estimates
    label = 'alpha = {0:.2f}\nbeta = {1:.2f}'.format(alpha_mom[m], beta_mom[m])
    ax.annotate(label, xy=(10, 0.2))
    
plt.tight_layout()


Maximum Likelihood

Maximum likelihood (ML) fitting is usually more work than the method of moments, but it is preferred as the resulting estimator is known to have good theoretical properties.

There is a ton of theory regarding ML. We will restrict ourselves to the mechanics here.

Say we have some data $y = y_1,y_2,\ldots,y_n$ that is distributed according to some distribution:

$$Pr(Y_i=y_i | \theta)$$

Here, for example, is a Poisson distribution that describes the distribution of some discrete variables, typically counts:


In [12]:
y = np.random.poisson(5, size=100)
plt.hist(y, bins=12, normed=True)
plt.xlabel('y'); plt.ylabel('Pr(y)')


Out[12]:
<matplotlib.text.Text at 0x10d4bd748>

The product $\prod_{i=1}^n Pr(y_i | \theta)$ gives us a measure of how likely it is to observe values $y_1,\ldots,y_n$ given the parameters $\theta$.

Maximum likelihood fitting consists of choosing the appropriate function $l= Pr(Y|\theta)$ to maximize for a given set of observations. We call this function the likelihood function, because it is a measure of how likely the observations are if the model is true.

Given these data, how likely is this model?

In the above model, the data were drawn from a Poisson distribution with parameter $\lambda =5$.

$$L(y|\lambda=5) = \frac{e^{-5} 5^y}{y!}$$

So, for any given value of $y$, we can calculate its likelihood:


In [13]:
poisson_like = lambda x, lam: np.exp(-lam) * (lam**x) / (np.arange(x)+1).prod()

lam = 6
value = 10
poisson_like(value, lam)


Out[13]:
0.041303093412337726

In [14]:
np.sum(poisson_like(yi, lam) for yi in y)


Out[14]:
11.499056250911673

In [15]:
lam = 8
np.sum(poisson_like(yi, lam) for yi in y)


Out[15]:
7.8028592304816868

We can plot the likelihood function for any value of the parameter(s):


In [16]:
lambdas = np.linspace(0,15)
x = 5
plt.plot(lambdas, [poisson_like(x, l) for l in lambdas])
plt.xlabel('$\lambda$')
plt.ylabel('L($\lambda$|x={0})'.format(x))


Out[16]:
<matplotlib.text.Text at 0x10a60d6a0>

How is the likelihood function different than the probability distribution function (PDF)? The likelihood is a function of the parameter(s) given the data, whereas the PDF returns the probability of data given a particular parameter value. Here is the PDF of the Poisson for $\lambda=5$.


In [17]:
lam = 5
xvals = np.arange(15)
plt.bar(xvals, [poisson_like(x, lam) for x in xvals], width=0.2)
plt.xlabel('x')
plt.ylabel('Pr(X|$\lambda$=5)')


Out[17]:
<matplotlib.text.Text at 0x10aa7cd68>

Why are we interested in the likelihood function?

A reasonable estimate of the true, unknown value for the parameter is one which maximizes the likelihood function. So, inference is reduced to an optimization problem.

Going back to the rainfall data, if we are using a gamma distribution we need to maximize:

$$\begin{align}l(\alpha,\beta) &= \sum_{i=1}^n \log[\beta^{\alpha} x^{\alpha-1} e^{-x/\beta}\Gamma(\alpha)^{-1}] \cr &= n[(\alpha-1)\overline{\log(x)} - \bar{x}\beta + \alpha\log(\beta) - \log\Gamma(\alpha)]\end{align}$$

N.B.: Its usually easier to work in the log scale

where $n = 2012 − 1871 = 141$ and the bar indicates an average over all i. We choose $\alpha$ and $\beta$ to maximize $l(\alpha,\beta)$.

Notice $l$ is infinite if any $x$ is zero. We do not have any zeros, but we do have an NA value for one of the October data, which we dealt with above.

Finding the MLE

To find the maximum of any function, we typically take the derivative with respect to the variable to be maximized, set it to zero and solve for that variable.

$$\frac{\partial l(\alpha,\beta)}{\partial \beta} = n\left(\frac{\alpha}{\beta} - \bar{x}\right) = 0$$

Which can be solved as $\beta = \alpha/\bar{x}$. However, plugging this into the derivative with respect to $\alpha$ yields:

$$\frac{\partial l(\alpha,\beta)}{\partial \alpha} = \log(\alpha) + \overline{\log(x)} - \log(\bar{x}) - \frac{\Gamma(\alpha)'}{\Gamma(\alpha)} = 0$$

This has no closed form solution. We must use numerical optimization!

Numerical optimization alogarithms take an initial "guess" at the solution, and iteratively improve the guess until it gets "close enough" to the answer.

Here, we will use Newton-Raphson method, which is a root-finding algorithm:

$$x_{n+1} = x_n - \frac{f(x_n)}{f'(x_n)}$$

which is available to us via SciPy:


In [18]:
from scipy.optimize import newton

Here is a graphical example of how Newton-Raphson converges on a solution, using an arbitrary function:


In [19]:
%run newton_raphson_plot.py


To apply the Newton-Raphson algorithm, we need a function that returns a vector containing the first and second derivatives of the function with respect to the variable of interest. The second derivative of the gamma distribution with respect to $\alpha$ is:

$$\frac{\partial^2 l(\alpha,\beta)}{\partial \alpha^2} = \frac{1}{\alpha} - \frac{\partial}{\partial \alpha} \left[ \frac{\Gamma(\alpha)'}{\Gamma(\alpha)} \right]$$

In [20]:
from scipy.special import psi, polygamma

dlgamma = lambda a, log_mean, mean_log: np.log(a) - psi(a) - log_mean + mean_log
dl2gamma = lambda a, *args: 1./a - polygamma(1, a)

where log_mean and mean_log are $\log{\bar{x}}$ and $\overline{\log(x)}$, respectively. psi and polygamma are complex functions of the Gamma function that result when you take first and second derivatives of that function.


In [21]:
# Calculate statistics
log_mean = precip.mean().apply(np.log)
mean_log = precip.apply(np.log).mean()

Time to optimize!


In [22]:
# Alpha MLE for December
alpha_mle = newton(dlgamma, 2, dl2gamma, args=(log_mean[-1], mean_log[-1]))
alpha_mle


Out[22]:
3.5189679152399647

And now plug this back into the solution for beta:

$$ \beta = \frac{\alpha}{\bar{X}} $$

In [23]:
beta_mle = alpha_mle/precip.mean()[-1]
beta_mle


Out[23]:
0.84261607548413797

We can compare the fit of the estimates derived from MLE to those from the method of moments:


In [24]:
dec = precip.Dec
dec.hist(normed=True, bins=10, grid=False)
x = np.linspace(0, dec.max())
plt.plot(x, gamma.pdf(x, alpha_mom[-1], beta_mom[-1]), 'm-', label='Moment estimator')
plt.plot(x, gamma.pdf(x, alpha_mle, beta_mle), 'r--', label='ML estimator')
plt.legend()


Out[24]:
<matplotlib.legend.Legend at 0x10e079c18>

For some common distributions, SciPy includes methods for fitting via MLE:


In [25]:
from scipy.stats import gamma

gamma.fit(precip.Dec)


Out[25]:
(2.2427517753152308, 0.65494604470188622, 1.570073932063466)

This fit is not directly comparable to our estimates, however, because SciPy's gamma.fit method fits an odd 3-parameter version of the gamma distribution.

Model checking

An informal way of checking the fit of our parametric model is to compare the observed quantiles of the data to those of the theoretical model we are fitting it to. If the model is a good fit, the points should fall on a 45-degree reference line. This is called a probability plot.

SciPy includes a probplot function that generates probability plots based on the data and a specified distribution.


In [26]:
from scipy.stats import probplot

probplot(precip.Dec, dist=gamma(3.51, scale=0.84), plot=plt);


Example: truncated distribution

Suppose that we observe $Y$ truncated below at $a$ (where $a$ is known). If $X$ is the distribution of our observation, then:

$$ P(X \le x) = P(Y \le x|Y \gt a) = \frac{P(a \lt Y \le x)}{P(Y \gt a)}$$

(so, $Y$ is the original variable and $X$ is the truncated variable)

Then X has the density:

$$f_X(x) = \frac{f_Y (x)}{1−F_Y (a)} \, \text{for} \, x \gt a$$

Suppose $Y \sim N(\mu, \sigma^2)$ and $x_1,\ldots,x_n$ are independent observations of $X$. We can use maximum likelihood to find $\mu$ and $\sigma$.

First, we can simulate a truncated distribution using a while statement to eliminate samples that are outside the support of the truncated distribution.


In [27]:
x = np.random.normal(size=10000)

# Truncation point
a = -1

# Resample until all points meet criterion
x_small = x < a
while x_small.sum():
    x[x_small] = np.random.normal(size=x_small.sum())
    x_small = x < a
    
_ = plt.hist(x, bins=100)


We can construct a log likelihood for this function using the conditional form:

$$f_X(x) = \frac{f_Y (x)}{1−F_Y (a)} \, \text{for} \, x \gt a$$

The denominator normalizes the truncated distribution so that it integrates to one.


In [28]:
from scipy.stats.distributions import norm

trunc_norm = lambda theta, a, x: -(np.log(norm.pdf(x, theta[0], theta[1])) - 
                                      np.log(1 - norm.cdf(a, theta[0], theta[1]))).sum()

For this example, we will use an optimization algorithm, the Nelder-Mead simplex algorithm. It has a couple of advantages:

  • it does not require derivatives
  • it can optimize (minimize) a vector of parameters

SciPy implements this algorithm in its fmin function:


In [29]:
from scipy.optimize import fmin

fmin(trunc_norm, np.array([1,2]), args=(-1, x))


Optimization terminated successfully.
         Current function value: 11084.885149
         Iterations: 47
         Function evaluations: 88
Out[29]:
array([ 0.00531007,  1.00652051])

In general, simulating data is a terrific way of testing your model before using it with real data.

Kernel density estimates

In some instances, we may not be interested in the parameters of a particular distribution of data, but just a smoothed representation of the data at hand. In this case, we can estimate the disribution non-parametrically (i.e. making no assumptions about the form of the underlying distribution) using kernel density estimation.


In [30]:
# Some random data
y = np.random.normal(10, size=15)
y


Out[30]:
array([  9.68444175,   6.86641455,   8.90236824,   8.48662651,
         9.4599326 ,   9.83248454,   9.45367613,  12.95585035,
        10.58747989,  10.94829904,  10.48694903,   9.13200438,
         9.73979452,  10.14852737,   8.62582257])

The kernel estimator is a sum of symmetric densities centered at each observation. The selected kernel function determines the shape of each component while the bandwidth determines their spread. For example, if we use a Gaussian kernel function, the variance acts as the bandwidth.


In [31]:
x = np.linspace(7, 13, 100)
# Smoothing parameter
s = 0.3
# Calculate the kernels
kernels = np.transpose([norm.pdf(x, yi, s) for yi in y])
plt.plot(x, kernels, 'k:')
plt.plot(x, kernels.sum(1))
plt.plot(y, np.zeros(len(y)), 'ro', ms=10)


Out[31]:
[<matplotlib.lines.Line2D at 0x10e27f710>]

SciPy implements a Gaussian KDE that automatically chooses an appropriate bandwidth. Let's create a bi-modal distribution of data that is not easily summarized by a parametric distribution:


In [32]:
# Create a bi-modal distribution with a mixture of Normals.
x1 = np.random.normal(0, 2, 50)
x2 = np.random.normal(5, 1, 50)

# Append by row
x = np.r_[x1, x2]

In [33]:
plt.hist(x, bins=10, normed=True)


Out[33]:
(array([ 0.02606255,  0.04343758,  0.12162523,  0.09556268,  0.08687517,
         0.0521251 ,  0.04343758,  0.21718792,  0.13900027,  0.04343758]),
 array([-3.97183054, -2.82075361, -1.66967669, -0.51859976,  0.63247717,
         1.78355409,  2.93463102,  4.08570795,  5.23678488,  6.3878618 ,
         7.53893873]),
 <a list of 10 Patch objects>)

In [34]:
from scipy.stats import kde

density = kde.gaussian_kde(x)
xgrid = np.linspace(x.min(), x.max(), 100)
plt.hist(x, bins=8, normed=True)
plt.plot(xgrid, density(xgrid), 'r-')


Out[34]:
[<matplotlib.lines.Line2D at 0x10e26b780>]

Exercise: Comparative Chopstick Effectiveness

A few researchers set out to determine what the optimal length for chopsticks is. The dataset chopstick-effectiveness.csv includes measurements of "Food Pinching Efficiency" across a range of chopstick lengths for 31 individuals.

Use the method of moments or MLE to calculate the mean and variance of food pinching efficiency for each chopstick length. This means you need to select an appropriate distributional form for this data.


In [35]:
# Write your answer here